This notebook analyses Netflix viewing activity.
The information can be requested from Netflix and is recieved in CSV format. These files are uploaded here and summarised.
#Import relevant libraries
import os
import pandas as pd
import numpy as np
import datetime
import seaborn as sns
import matplotlib.pyplot as plt
from pprint import pprint
from IPython.display import Markdown as md
%matplotlib inline
#Widen print screen
pd.options.display.width = 125
The following function searches a directory for a files containing a given string. It returns a list of all files which match. It takes two inputs:
dir: The directory in which to search.
name: The name of the files to search for.
#Specify the directory in which the CSV files reside
directory = 'C:\\Users\\rickp\\Documents\\projects\\Data\\netflix-report\\CONTENT_INTERACTION\\'
#Function to find all files in specified directory containing specified test (name)
def list_files(dir, name, ext="csv"):
''' Lists all files which contain specified 'name' in specified directory '''
lst_files = []
#Get list of files in directory which contain specified name
lst_files = [k for k in os.listdir(dir) if name.lower() in k.lower()]
#Only keep files with specified extension (defaulted to CSV) and return from function
return [item for item in lst_files if item.split(".")[-1] == ext]
lst_files = list_files(directory,'viewingactivity')
lst_files
The following function loops through a list of file names, extracts the data, and concatenates them all into a single dataframe. It takes two inputs:
dir_str: The directory in which to search.
file_list: The name of the files to extract.
In this case it's using the same directory as used in the function above, and the resulting list of file names it has produced.
NOTE: The files should all have the same columns
#Function to extract files from a directory and concatenate into a single data frame
def concatenate_files(dir_str,file_list):
merged = []
columns = {}
print("Attempting to read the following",len(file_list),"files from",dir)
print(file_list)
for n, file in enumerate(file_list):
dir_file = dir_str + file
print("\nFile ",n+1,") Attempting to read: ",file,sep="")
try:
read_file = pd.read_csv(dir_file, parse_dates = ['Start Time'])
print("Number of rows and columns read in:",read_file.shape)
columns[n] = sorted(list(read_file.columns))
except:
print("Error with read. File skipped")
#Only concatenate if columns are the same as the initial file read in
if columns[0] == columns[n]:
merged.append(read_file)
concatenated_data = pd.concat(merged)
print("File successfully concatenated to main dataframe")
else:
print("*** WARNING:",file,"does not have the same columns as the first file so can't concatenate ***")
print("\nLoop complete. Total number of rows in final dataframe:",f'{concatenated_data.shape[0]:,}')
concatenated_data.sort_values(by=['Start Time'], ascending=False, inplace=True)
concatenated_data = concatenated_data.reset_index(drop=True)
return concatenated_data
#Run function. lst_files is the output from the list_files function
df_netflix = concatenate_files(directory,lst_files)
Now the CSV files have been successfully imported and concatenated into a single dataframe called 'df_netflix', lets take a look at what it contains.
df_netflix.info()
FIELD DEFINITIONS
Profile Name: the name of the profile in which viewing occurred
Duration: the length of the viewing session
Title: the TV show or movie viewed
Supplemental Video Type: videos other than a TV show or a movie (such as trailers)
Attributes: was the title autoplayed
Device Type: on which device was the title viewed on
'Bookmark', 'Latest Bookmark', and 'Attributes' are not required for analysis so we'll delete them and then take a look at the top 5 rows of the data.
#Remove fields which are not useful
df_netflix = df_netflix.drop(['Bookmark', 'Latest Bookmark', 'Attributes'], axis=1)
df_netflix.head()
First lets look at 'Supplemental Video Type'. This displays if the video was a trailer or preview etc. The majority of entries are missing (NaN) so we can assume that these represent a film or series.
print(df_netflix['Supplemental Video Type'].value_counts(dropna = False))
We're only going to keep viewings which are 5 minutes or longer so this should remove most of these populated entries.
First of all we need to create a new numeric field called 'Duration in Minutes' because 'Duration' is a string showing Hours:Minutes:Seconds. All viewings less than 5 minutes will be dropped then we'll check 'Supplemental Video Type' again to make sure it only contains missing values.
#Split duration into HH:MM and create duration in minutes (HH * 60 + MM)
df_netflix['Duration in Minutes'] = df_netflix['Duration'].str.split(':').apply(lambda x: int(x[0]) * 60 + int(x[1]))
#Drop rows with duration less than 5 minutes
df_netflix = df_netflix[df_netflix['Duration in Minutes']>5]
#Make sure it has worked
print("Double check to ensure only NaN have been kept:\n",df_netflix['Supplemental Video Type'].value_counts(dropna = False),sep="")
#And then delete the field entirely as no longer required.
df_netflix.drop(['Supplemental Video Type'], axis =1, inplace= True)
Excellent! Only missing values have been successfully kept. We can analyse viewing trends knowing we're only looking at genuine films or TV programs which have been viewed for 5 minutes or longer.
Now lets look at who has been using Netflix after cleaning up the column 'Profile Name' slighty. Some profiles are going to be:
#Firstly,remove leading and trailing blanks from profile name
df_netflix['Profile Name'] = df_netflix['Profile Name'].str.strip()
#Drop some profiles
drop_profiles = ["D&D","M&T","Kids"]
columns_before_drop = df_netflix.shape[0]
df_netflix = df_netflix[~df_netflix['Profile Name'].isin(drop_profiles)].reset_index(drop=True)
#Clean & rename some profiles
df_netflix['Profile Name'] = df_netflix['Profile Name'].apply(lambda x: "Rosie and Greg" if x == "R&G" else x)
df_netflix['Profile Name'] = df_netflix['Profile Name'].apply(lambda x: "Brian" if x == "Dad" else x)
df_netflix['Profile Name'] = df_netflix['Profile Name'].apply(lambda x: "Jenny" if x == "Mom" else x)
#Re-assign some profiles depending on viewing
conditions = [
(df_netflix['Profile Name'] == "Maria and Rick") & (df_netflix['Title'].str.contains("The Haunting")),
(df_netflix['Profile Name'] == "Maria and Rick") & (df_netflix['Title'].str.contains("Gilmore")),
(df_netflix['Profile Name'] == "Maria and Rick") & (df_netflix['Title'].str.contains("Mamma Mia! Here We Go Again")),
(df_netflix['Profile Name'] == "Maria and Rick") & (df_netflix['Title'].str.contains("Gossip Girl")),
(df_netflix['Profile Name'] == "Maria and Rick") & (df_netflix['Title'].str.contains("You:"))
]
choices = [
"Rick",
"Maria",
"Maria",
"Maria",
"Maria"
]
#Apply profile re-assign
df_netflix['Profile Name'] = np.select(conditions, choices, default=df_netflix['Profile Name'])
#Display frequency and % by profile name
print("\n(NOTE: The profiles",
drop_profiles,
"have been deleted which has removed", columns_before_drop - df_netflix.shape[0], "rows)\n")
c = df_netflix['Profile Name'].value_counts(dropna=False)
p = df_netflix['Profile Name'].value_counts(dropna=False, normalize=True).mul(100).round(1).astype(str) + '%'
print(pd.concat([c,p], axis=1, keys=["Titles Watched", '%']))
most_viewings = c.take([0]).to_frame().reset_index()
most_viewings_p = p.take([0]).to_frame().reset_index()
md("<br><mark>{}</mark> has watched the most titles with {}.<br> This is {} of the total population.".format(most_viewings['index'][0], most_viewings['Profile Name'][0], most_viewings_p['Profile Name'][0]))
Now we're going to clean up the 'Country' field by removing the country code and also reallocating to the correct country when we know a VPN has been used.
country_before = list(df_netflix.Country.unique())
print("List of Countries in the data BEFORE cleansing:")
pprint(country_before)
#Clean Country field to remove code and brackets
df_netflix['Country'] = df_netflix['Country'].str.split("(",expand=True)[1].str.strip().str[0:-1].str.strip()
#Clean known VPN usage using np.arrays rather than apply function (it is quicker)
conditions = [
(df_netflix['Profile Name'] == 'Dylan') & (df_netflix['Country'] == "United States"),
(df_netflix['Profile Name'] == 'Greg') & (df_netflix['Country'] == "Canada")]
choices = [
"United Kingdom",
"United Kingdom"]
#Assign VPN clean up to data
df_netflix['Country'] = np.select(conditions, choices, default=df_netflix['Country'])
country_after = list(df_netflix.Country.unique())
print("\nList of Countries in the data AFTER cleansing (including reassigning VPN usage so some countries will disappear):")
pprint(country_after)
OK. That looks much better.
Finally for the data cleansing we're going to reassign 'Device Type' to more meaningful groups. There are a whole range of different types of devices used but we're going to group these into Computer, TV, Phone, Tablet, & Other.
#Group into TV, Phone, Tablet, Games Console, Laptop, Other
#MAKE SURE REGEX IS WORKING CORRECTLY! Find out what exact match is rather than 'contains'
conditions = [
df_netflix['Device Type'].str.contains('TV|Stick|BD Recorder|Sky|Xbox|Playstation|Theater', case=False, regex=True),
df_netflix['Device Type'].str.contains('Phone|Mobile', case=False, regex=True),
df_netflix['Device Type'].str.contains('Tablet|ipad', case=False, regex=True),
df_netflix['Device Type'].str.contains('PC|OS|Edge|Mac|Chrome|Windows', case=False, regex=True)]
choices = [
"TV",
"Phone",
"Tablet",
"Computer"]
#Apply new device groups
df_netflix['Device Type Group'] = np.select(conditions, choices, default="Other")
#Check they are assigned properly:
pd.crosstab(df_netflix['Device Type'], df_netflix['Device Type Group'], margins=True, margins_name="Total")
OK! Everything looks like it has been assigned to a reasonable group and nothing has fallen into the 'OTHER' category.
We also want to reassign Phone to TV if they have watched for over 60 minutes. This is purely an assumption that the phone was probably used to cast to a tv rather than the phone screen being used.
#Reassign Phone to TV if used for more than an hour
print("BEFORE phone being reassigned to TV:")
before_TV_PHONE = df_netflix.loc[df_netflix['Device Type Group'].isin(['Phone', 'TV'])]
print(before_TV_PHONE['Device Type Group'].value_counts())
def update_phone_tv(row):
if row['Device Type Group'] == "Phone" and row['Duration in Minutes']>60:
return "TV"
else:
return row['Device Type Group']
df_netflix['Device Type Group'] = df_netflix.apply(update_phone_tv, axis = 'columns')
print("\nAFTER phone being reassigned to TV:")
after_TV_PHONE = df_netflix.loc[df_netflix['Device Type Group'].isin(['Phone', 'TV'])]
print(after_TV_PHONE['Device Type Group'].value_counts())
As can be seen above, a small number of phone viewings have been reassigned to TV.
The final part of data cleansing is to remove duplicates. For some reason the data is showing there is more than one entry for the same profile watching the same title at the same time! This only happens a small number of times.
#Sort data
df_netflix = df_netflix.sort_values(by = ['Profile Name', 'Start Time', 'Duration'], ascending = False)
#Remove the row if the previous row has the same profile and start date (i.e. start time)
before_dups = df_netflix.shape[0]
print("Number of rows before removing duplicates:",f'{before_dups:,}')
df_netflix = df_netflix.drop(df_netflix[(df_netflix['Profile Name'].shift(1) == df_netflix['Profile Name'])
& (df_netflix['Start Time'].shift(1) == df_netflix['Start Time'])].index).reset_index(drop= True)
print("Number of rows after removing duplicates:",f'{df_netflix.shape[0]:,}')
print("Duplicates removed:",f'{before_dups - df_netflix.shape[0]:,}')
The data has now been well and truely cleansed! Lets start analysing!
Lets start with some basic counts.
What devices have been used?
#Summarise viewings by device type
df_device_info = df_netflix.groupby('Device Type Group')['Duration in Minutes'].agg(['count', 'sum', 'mean'])
#Bar chart of viewings by duration
df_device_info.iloc[:,1].plot(kind='bar',
figsize=(13,6),
color='lightblue',
title="Total duration watched by device type")
plt.ylabel('Duration in Minutes')
plt.show()
#Bar chart of viewings by average duration
df_device_info.iloc[:,2].plot(kind='bar',
figsize=(13,6),
color='lightblue',
title="Average duration watched by device type")
plt.ylabel('Duration in Minutes')
plt.show()
TV has been used the most by far but surprisingly the longest average time for each individual title being viewed is on the computer.
Now lets see who's using Netflix the most.
#Summarise viewings by device type
df_profile_info = df_netflix.groupby('Profile Name')['Duration in Minutes'].agg(['count', 'sum', 'mean'])
#Bar chart of viewings by duration
df_profile_info = df_profile_info.sort_values('sum', ascending = False)
df_profile_info.iloc[:,1].plot(kind='bar',
figsize=(13,6),
color='lightblue',
title="Total duration watched by profile")
plt.ylabel('Duration in Minutes')
plt.show()
#Bar chart of viewings by duration
df_profile_info = df_profile_info.sort_values('mean', ascending = False)
df_profile_info.iloc[:,2].plot(kind='bar',
figsize=(13,6),
color='lightblue',
title="Average duration watched by profile")
plt.ylabel('Duration in Minutes')
plt.show()
Looks like Greg watches a lot of short shows!
Now lets check out which country people are viewing in:
#Summarise viewings by device type
df_country_info = df_netflix.groupby('Country')['Duration in Minutes'].agg(['count', 'sum', 'mean'])
#Bar chart of viewings by duration
df_country_info = df_country_info.sort_values('sum', ascending = False)
df_country_info.iloc[:,1].plot(kind='bar',
figsize=(13,6),
color='lightblue',
title="Total duration watched in each country")
plt.ylabel('Duration in Minutes')
plt.show()
Unsurprisingly, most of the viewings in our population are in the UK. There are some countries in which people have watched something. Lets find out who they are.
#Show cross tab of profile name by country to see where people have been watching Netflix
print("\nCount of individual viewings per Country by profile")
pd.crosstab(df_netflix['Country'], df_netflix['Profile Name'], margins=True, margins_name="Total")
Now lets see how long people are watching for.
# Create duration grouped column using NUMPY SELECT function (vectorised) -- WAY FASTER THAN APPLY!!!!#
#Minute intervals (e.g. 15 = 15 minute intervals)
groupby_min = 15
#How many groups required?
groups = 10
#Ensure the stop group is divisible by 15
stop_at_min = groups*groupby_min
#Create automatic bandings of minutes ("Duration in Minutes") depending on info entered above
conditions = []
choices = []
for n,i in enumerate([*range(groupby_min, stop_at_min+1, groupby_min)]):
conditions.append(df_netflix['Duration in Minutes'] <= i)
choices.append(str(n)+") "+str(i-groupby_min) + " to " + str(i) + " minutes")
df_netflix['Duration Grouped'] = np.select(conditions , choices, default=">"+str(stop_at_min)+' minutes')
df_netflix['Duration Grouped'].value_counts().sort_index().plot(kind='bar',
figsize=(13,6),
color='lightblue',
title="Frequency of viewings by duration")
TV shows upto 30 minutes are the most frequently viewed.
There seem to be a small number of films which had a duration of over 135 minutes (2 and a quarter hours).
Lets see what these were and who watched the longest one!
dataset = df_netflix.loc[(df_netflix['Duration in Minutes'] >=135),['Profile Name', 'Title', 'Device Type Group', 'Duration']].sort_values(by = 'Duration', ascending = False).reset_index(drop= True)
display(dataset)
longest_viewing = dataset.loc[0,['Profile Name', 'Title', 'Duration']]
md("<br>It looks like the crown for viewing the longest film goes to <mark>{}</mark>!<br>It took {} to watch '{}'.".format(longest_viewing['Profile Name'], longest_viewing['Duration'], longest_viewing['Title']))
Lets now see when people are watching the most
## Create date variables for graphs ##
#Weekday
cats_days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
df_netflix['Weekday'] = df_netflix['Start Time'].dt.strftime('%A').astype('category').cat.set_categories(cats_days)
#Month
cats_month = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
df_netflix['Month'] = df_netflix['Start Time'].dt.strftime('%b').astype('category').cat.set_categories(cats_month)
#Year, and Year Month
df_netflix['Year'] = df_netflix['Start Time'].dt.strftime('%Y')
df_netflix['Year Month'] = df_netflix['Start Time'].dt.strftime('%Y %m')
#Add hour of day & some other time related columns for analysis later
cats_hour = ['00','01','02','03','04','05','06','07','08','09','10','11',
'12','13','14','15','16','17','18','19','20','21','22','23']
df_netflix['Hour of Day'] = df_netflix['Start Time'].dt.strftime("%H").astype('category').cat.set_categories(cats_hour)
#Count viewings by weekday and plot
df_weekday_dur = df_netflix.groupby('Weekday')['Duration in Minutes'].agg(['count', 'sum', 'mean']).reset_index()
fig, (ax1, ax2) = plt.subplots(1, 2)
#Count
pd1 = df_weekday_dur.plot.bar(x = "Weekday",
y = "count",
ax = ax1,
figsize = (17,6),
title = "COUNT of viewings by Weekday",
color = "lightblue",)
#Average
pd2 = df_weekday_dur.plot.bar(x = "Weekday",
y = "mean",
ax = ax2,
figsize = (17,6),
title = "AVERAGE duration (minutes) by Weekday",
color = "lightblue")
plt.show()
df_most_viewing = df_weekday_dur.loc[df_weekday_dur['count'].idxmax(),['Weekday']][0]
df_least_viewing = df_weekday_dur.loc[df_weekday_dur['count'].idxmin(),['Weekday']][0]
df_most_dur_day = df_weekday_dur.loc[df_weekday_dur['mean'].idxmax(),['Weekday']][0]
df_most_dur = df_weekday_dur.loc[df_weekday_dur['mean'].idxmax(),['mean']][0].round(2)
df_least_dur_day = df_weekday_dur.loc[df_weekday_dur['mean'].idxmin(),['Weekday']][0]
df_least_dur = df_weekday_dur.loc[df_weekday_dur['mean'].idxmin(),['mean']][0].round(2)
display(md("The day with the most individual viewings is **{}** and the day with the least is **{}**.".format(df_most_viewing,df_least_viewing)))
display(md("However, on average the longest duration per individual viewing takes place on **{}** ({} minutes) and the shortest on **{}** ({} minutes).<br><br>".format(df_most_dur_day,df_most_dur,df_least_dur_day,df_least_dur)))
display(md("Interesting, now lets see what time of day people start to watch.<br>"))
#Count viewings by hour of day and plot
df_time_count = df_netflix.groupby('Hour of Day')['Duration in Minutes'].agg(['mean', 'count']).reset_index()
df_time_count.plot.bar(x = "Hour of Day",
y = 'count',
legend = False,
figsize = (17,6),
title = "Number of viewings by hour of day",
color = "lightblue")
plt.xlabel('Hour of Day')
plt.show()
df_most_count = df_time_count.loc[df_time_count['count'].idxmax(),['Hour of Day']][0]
df_least_count = df_time_count.loc[df_time_count['count'].idxmin(),['Hour of Day']][0]
display(md("The most popular time to start watching is **{}** and the least popular is **{}**.<br><br>".format(df_most_count,df_least_count)))
df_time_count.plot.bar(x = "Hour of Day",
y = 'mean',
legend = False,
figsize = (17,6),
title = "Average duration of viewings by hour of day",
color = "lightblue")
plt.xlabel('Hour of Day')
plt.show()
df_most_dur = df_time_count.loc[df_time_count['mean'].idxmax(),['Hour of Day']][0]
df_least_dur = df_time_count.loc[df_time_count['mean'].idxmin(),['Hour of Day']][0]
display(md("The average viewing duration is highest for shows which start at **{}** and the lowest duration is **{}**.".format(df_most_dur,df_least_dur)))
display(md("The higher average duration later in the day is probably due to more films being watched. 4am longest duration!? Perhaps this is due to low volume of things being watched at that time and those who did watch put on a film?"))
What month and year had the highest average viewing times.
plt.figure(figsize=(10,5))
df_tot_dur = df_netflix.groupby(['Year','Month'])['Duration in Minutes'].agg('sum').reset_index().sort_values(by = ['Year','Month'])
df_tot_dur = df_tot_dur.pivot("Month", "Year", "Duration in Minutes")
sns.heatmap(df_tot_dur, cmap="Blues")
plt.title("Total duration in minutes\n", fontsize =15)
plt.yticks(rotation=0)
plt.show()
plt.figure(figsize=(10,5))
df_avg_duration = df_netflix.groupby(['Year','Month'])['Duration in Minutes'].agg('mean').reset_index().sort_values(by = ['Year','Month'])
df_avg_duration = df_avg_duration.pivot("Month", "Year", "Duration in Minutes")
sns.heatmap(df_avg_duration, cmap="Blues", annot = True)
plt.title("Average duration per viewing in minutes", fontsize =15)
plt.show()
This shows our population is heavily weighted, with most viewings occuring from 2019 onwards. The higher average viewing time in the earlier years will be down to a smaller number of profiles. Lets check that out. The first scatter plot is for the whole population, and the following few are for a select number of profiles.
#Count the number of titles watched by month and hour of day for population
srs_all_dates_base = df_netflix.groupby(['Year Month', 'Hour of Day']).Title.agg('count').reset_index()
srs_all_dates_base.sort_values(by = ['Year Month', 'Hour of Day'], inplace = True)
#Create scatter plot like a heat map. Change 'c = ' to Title_pop or Title_individual depending if you want to see the whole population or for the named individual above
plt.figure(figsize=(25, 7))
plt.scatter(x = srs_all_dates_base['Year Month'],
y = srs_all_dates_base['Hour of Day'].astype(int),
c = srs_all_dates_base['Title'],
cmap='Blues')
plt.title("Time when viewings began - WHOLE POPULATION. The darker the dot the more viewings were started in that hour")
plt.ylabel('Hour of Day')
plt.xlabel('Year / Month')
plt.yticks(np.arange(0, 24, step=1))
plt.xticks(rotation=90)
plt.colorbar();
plt.show()
#Count the number of titles watched by month and hour of day for an individual profile
for name in ['Greg', 'Dylan', 'Rick', 'Jenny']:
df_filtered = df_netflix.loc[df_netflix['Profile Name'].str.contains(name)]
df_month_by_hour = df_filtered.groupby(['Year Month','Hour of Day']).Title.agg('count').reset_index()
#Join population and individual counts together
srs_all_dates = pd.merge(srs_all_dates_base,
df_month_by_hour,
on = ['Year Month','Hour of Day'],
how = 'left',
suffixes=('_pop','_individual'))
#Create scatter plot like a heat map. Change 'c = ' to Title_pop or Title_individual depending if you want to see the whole population or for the named individual above
plt.figure(figsize=(25, 7))
plt.scatter(x = srs_all_dates['Year Month'],
y = srs_all_dates['Hour of Day'].astype(int),
c = srs_all_dates['Title_individual'],
cmap='Blues')
plt.title("Time when viewings began for " + name)
plt.ylabel('Hour of Day')
plt.xlabel('Year / Month')
plt.yticks(np.arange(0, 24, step=1))
plt.xticks(sorted(df_netflix['Year Month'].unique()), rotation=90)
plt.xticks(rotation=90)
plt.colorbar();
plt.show()
This scatter plot per profile shows that anything watched 2016 and before is pretty much solely down to Greg's sole account. Lets double check that by seeing the earliest date on every profile.
#Earliest date per profile
earliest_prof = df_netflix.groupby('Profile Name')['Start Time'].agg('min').sort_values().dt.strftime('%d %b %Y').reset_index()
display(md("The earliest viewing is by **{}** on **{}**.".format(earliest_prof['Profile Name'][0],earliest_prof['Start Time'][0])))
display(md("The next profile to begin viewing is **{}** on **{}**.".format(earliest_prof['Profile Name'][1],earliest_prof['Start Time'][1])))
earliest_prof
Lets check out what time on each weekday people are watching.
#Loop per name in list and create two heatmaps. Total count of shows watched and average viewing time
for name in df_netflix['Profile Name'].unique():
#Create 2 plots side by side
fig, axes = plt.subplots(1, 2, figsize=(22,6), dpi=300)
#Filter data per name
df_filtered = df_netflix.loc[df_netflix['Profile Name'].str.contains(name)]
#Calculate MEAN and COUNT by weekday and hour of day
df_day_time = df_filtered.groupby(['Weekday','Hour of Day'])['Duration in Minutes'].agg(['mean', 'count']).reset_index().sort_values(by = ['Weekday','Hour of Day'], ascending = False)
#Turn both MEAN and COUNT into pivot tables. Reindex with categorys created earlier preserves heatmap axis if no data present
df_day_time_mean_piv = df_day_time.loc[df_day_time['count'] >=3].pivot("Weekday", "Hour of Day", "mean")
df_day_time_mean_piv = df_day_time_mean_piv.reindex(cats_hour, axis = 1)
df_day_time_mean_piv = df_day_time_mean_piv.reindex(cats_days, axis = 0)
df_day_time_count_piv = df_day_time.pivot("Weekday", "Hour of Day", "count")
df_day_time_count_piv = df_day_time_count_piv.reindex(cats_hour, axis = 1)
df_day_time_count_piv = df_day_time_count_piv.reindex(cats_days, axis = 0)
#Create COUNT heatmap and assign to left plot (axes 0)
sns.heatmap(df_day_time_count_piv, cmap="viridis", linewidths=.5, annot = True, ax = axes[0])
axes[0].title.set_text('Count of shows started at specified time')
axes[0].set_ylabel('')
#Create AVERAGE viewing time heatmap and assign to right plot (axes 1)
sns.heatmap(df_day_time_mean_piv, cmap="viridis", linewidths=.5, ax = axes[1])
axes[1].title.set_text('Average viewing time in minutes (for cells with 3 or more shows watched)')
axes[1].set_ylabel('')
#Add a single title above the two plots
fig.suptitle(name.upper()+"'s heatmaps", fontsize = 15, weight = 800)
plt.show()
#Uncomment this print to see titles watched in individual cells. Just change Hour and Weekday to required values
#print(df_netflix.loc[(df_netflix['Profile Name'].str.contains(name)) & (df_netflix['Hour of Day'] == "06") & (df_netflix['Weekday'] == "Tuesday")])
Lets see who uses which device. The following table shows the % useage of each device per profile (each column adds up to 100%)
pd.crosstab(df_netflix['Device Type Group'], df_netflix['Profile Name'], margins = True, margins_name="Total", normalize="columns").mul(100).round(1).astype(str) + '%'
Greg is a bit of an all-rounder whereas Steve likes his computer.
We've seen who watched the titles with the longest duration, but lets find who binge watches things all in one go! This shows who watched the most in one day (up to 5am the following morning)
#Anything upto 5am the following morning counts as the previous day for the purpose of binge watching
def binge(row):
if row['Hour of Day'] == "00" or row['Hour of Day'] == "01" or row['Hour of Day'] == "02" or row['Hour of Day'] == "03" or row['Hour of Day'] == "04":
return row['Start Time'] + pd.offsets.Day(-1)
else:
return row['Start Time']
#Apply bing function and use Binge Day to measure duration
df_netflix['Binge Day'] = df_netflix.apply(binge, axis = 'columns')
df_netflix['Binge Day'] = df_netflix['Binge Day'].dt.strftime("%d/%m/%Y")
#Sum duration by binge day
df_binge = df_netflix.groupby(['Profile Name', 'Binge Day'])['Duration in Minutes'].agg(['count', 'sum', 'mean']).sort_values(by = 'sum', ascending = False).reset_index()
#Top binge day per profile. Format and rename then print
binge_top10 = df_binge.groupby('Profile Name').apply(lambda df: df.iloc[0,:]).sort_values(by = 'sum', ascending = False).reset_index(drop = True)
binge_top10['mean'] = binge_top10['mean'].round(2)
binge_top10.rename(columns = {'count': 'Num shows watched', 'sum': 'Total minutes', 'mean': 'Avg minutes per show'}, inplace = True)
print("These are the top binge days per profile:\n\n",binge_top10)
#Highlight and print the top binger!
binge_max = df_binge.loc[df_binge['sum'].idxmax(),['Profile Name', 'Binge Day', 'sum']]
display(md("Our king binger is <mark>{}</mark>! **{}** hours of Netflix were watched on **{}**.".format(binge_max[0], (binge_max[2]/60).round(2), binge_max[1])))
#Print binge day per profile
for index, row in binge_top10.iterrows():
print("\nBinge day viewing for", row['Profile Name'], "on", row['Binge Day'],":")
display(df_netflix.loc[(df_netflix['Profile Name'] == row['Profile Name']) & (df_netflix['Binge Day'] == row['Binge Day']),['Profile Name', 'Title', 'Start Time', 'Duration in Minutes']].sort_values(by = 'Start Time').style.hide_index())
What are the top 20 most watched titles (based on duration) and who has watched them?
most_watched_ordered = df_netflix.groupby(['Profile Name', 'Title'])['Duration in Minutes'].agg('sum').reset_index()
#most_watched = most_watched_ordered.pivot("Title", "Profile Name", "Duration in Minutes")
most_watched = most_watched_ordered.pivot_table(index=['Title'], values=['Duration in Minutes'], columns=['Profile Name'], aggfunc='sum', margins = True, margins_name = 'Total')
most_watched = most_watched.reindex(most_watched['Duration in Minutes'].sort_values(by='Total', ascending=False).index).fillna(0).drop('Total')
most_watched.head(20)
And finally, lets see what films are people watching around christmas time.
(Lets assume anything which had a duration over 75 minutes is a film)
#We're going to look at every christmas from the earliest date in the data until the current year
#Create list of years between this time frame
lst_avail_dates = [x for x in range(df_netflix['Start Time'].describe()[4].year,datetime.datetime.now().year)]
display(md("### <U>Films watched at <font color='red'>Christmas</font> and New Year</U>"))
display(md("Checking for films watched between 22nd December and 2nd January<br><br>"))
#Function to return data between two dates with a duration >= 'dur'
def select_time_period(from_date, to_date, dur):
df_time = df_netflix.loc[(df_netflix['Duration in Minutes'] >= dur) &
(df_netflix['Start Time'].between(from_date, to_date)),
['Start Time', 'Title', 'Profile Name', 'Duration']].sort_values(by = 'Start Time', ascending = True)
display(md("**{}**".format(from_date[0:4])))
if df_time.empty:
print("No movies watched by anyone this year\n")
else:
#Get rid of index for print
blankIndex=[''] * len(df_time)
df_time.index=blankIndex
print(df_time)
#Loop all years in list and call function
for year in lst_avail_dates:
select_time_period(str(year)+'-12-22', str(year+1)+'-01-02', 75)